×

👋 Hello, Anthropic team!

Thanks for checking out my blog. I'm excited about the opportunity to work with you on building safe, beneficial AI systems.

Feel free to explore the posts on AI alignment, verification theory, and software engineering.

— James

Tag: strategic-deception

Blog Posts

Why Alignment Verification Might Be Fundamentally Broken

We've known since 1936 that universal verification is impossible. Now we're trying it on AI systems that adapt to detection.

For any detector f, it is possible to construct a program g that can bypass or defeat it. Any alignment test becomes a signal that says, "Humans are watching."

System Prompt Testing Methodology

These notes are part of my experiment in "learning in public" through a semi-automated Zettelkasten. Each note is atomic (containing one core idea), heavily interconnected, and designed to evolve as my understanding deepens.

This first note tackles AI system prompt testing, but not the "did it give the right answer" kind. Traditional frameworks already handle that. Instead, this methodology tests whether an AI maintains its boundaries when someone tries to break them.

AI systems face unique attack vectors. "Ignore previous instructions" shouldn't work, yet variations slip through. Security researchers keep rediscovering the same vulnerabilities because we lack systematic approaches to behavioral testing.

The methodology covers four core dimensions: behavioral consistency, boundary enforcement, adversarial stress testing, and context degradation. Each includes concrete attack patterns—everything from simple role confusion to sophisticated prompt injections hidden in code comments.